Search CORE

59 research outputs found

The Design of Eman, an Experiment Manager

Author: Bojar Ondřej
Tamchyna Aleš
Publication venue
Publication date: 01/01/2013
Field of study

We present eman, a tool for managing large numbers of computational experiments. Over the years of our research in machine translation (MT), we have collected a couple of ideas for efficient experimenting. We believe these ideas are generally applicable in (computational) research of any field. We incorporated them into eman in order to make them available in a command-line Unix environment. The aim of this article is to highlight the core of the many ideas. We hope the text can serve as a collection of experiment management tips and tricks for anyone, regardless their field of study or computer platform they use. The specific examples we provide in eman’s current syntax are less important but they allow us to use concrete terms. The article thus also fills the gap in eman documentation by providing some high-level overview

CiteSeerX

Biblio at Institute of Formal and Applied Linguistics

Modeling Target-Side Inflection in Neural Machine Translation

Author: Fraser Alexander
Marco Marion Weller-Di
Tamchyna Aleš
Publication venue
Publication date: 01/01/2017
Field of study

NMT systems have problems with large vocabulary sizes. Byte-pair encoding (BPE) is a popular approach to solving this problem, but while BPE allows the system to generate any target-side word, it does not enable effective generalization over the rich vocabulary in morphologically rich languages with strong inflectional phenomena. We introduce a simple approach to overcome this problem by training a system to produce the lemma of a word and its morphologically rich POS tag, which is then followed by a deterministic generation step. We apply this strategy for English-Czech and English-German translation scenarios, obtaining improvements in both settings. We furthermore show that the improvement is not due to only adding explicit morphological information.Comment: Accepted as a research paper at WMT17. (Updated version with corrected references.

arXiv.org e-Print Archive

Crossref

Target-Side Context for Discriminative Models in Statistical Machine Translation

Author: Bojar Ondřej
Fraser Alexander
Junczys-Dowmunt Marcin
Tamchyna Aleš
Publication venue
Publication date: 01/01/2016
Field of study

Discriminative translation models utilizing source context have been shown to help statistical machine translation performance. We propose a novel extension of this work using target context information. Surprisingly, we show that this model can be efficiently integrated directly in the decoding process. Our approach scales to large training data sizes and results in consistent improvements in translation quality on four language pairs. We also provide an analysis comparing the strengths of the baseline source-context model with our extended source-context and target-context model and we show that our extension allows us to better capture morphological coherence. Our work is freely available as part of Moses.Comment: Accepted as a long paper for ACL 201

arXiv.org e-Print Archive

Crossref

Biblio at Institute of Formal and Applied Linguistics

Improving Evaluation of English-Czech MT through Paraphrasing

Author: Barančíková Petra
Rosa Rudolf
Tamchyna Aleš
Publication venue
Publication date: 01/01/2014
Field of study

In this paper, we present a method of improving the accuracy of machine translation evaluation of Czech sentences. Given a reference sentence, our algorithm transforms it by targeted paraphrasing into a new synthetic reference sentence that is closer in wording to the machine translation output, but at the same time preserves the meaning of the original reference sentence. Grammatical correctness of~the new reference sentence is provided by applying Depfix on newly created paraphrases. Depfix is a system for post-editing English-to-Czech machine translation outputs. We adjusted it to fix the errors in paraphrased sentences. Due to a noisy source of our paraphrases, we experiment with adding word alignment. However, the alignment reduces the number of paraphrases found and the best results were achieved by~a~simple greedy method with only one-word paraphrases thanks to their intensive filtering. BLEU scores computed using these new reference sentences show significantly higher correlation with human judgment than scores computed on the original reference sentences

Biblio at Institute of Formal and Applied Linguistics

MTMonkey: A Scalable Infrastructure for a Machine Translation Web Service

Author: Dušek Ondřej
Pecina Pavel
Rosa Rudolf
Tamchyna Aleš
Publication venue
Publication date: 01/01/2013
Field of study

We present a web service which handles and distributes JSON-encoded HTTP requests for machine translation (MT) among multiple machines running an MT system, including text pre- and post processing. It is currently used to provide MT between several languages for cross-lingual information retrieval in the Khresmoi project. The software consists of an application server and remote workers which handle text processing and communicate translation requests to MT systems. The communication between the application server and the workers is based on the XML-RPC protocol. We present the overall design of the software and test results which document speed and scalability of our solution. Our software is licensed under the Apache 2.0 licence and is available for download from the Lindat-Clarin repository and Github

CiteSeerX

Biblio at Institute of Formal and Applied Linguistics

CUNI in WMT14: Chimera Still Awaits Bellerophon

Author: Bojar Ondřej
Popel Martin
Rosa Rudolf
Tamchyna Aleš
Publication venue
Publication date: 01/01/2014
Field of study

We present our English→Czech and English→Hindi submissions for this year’s WMT translation task. For English→Czech, we build upon last year’s CHIMERA and evaluate several setups. English→Hindi is a new language pair for this year. We experimented with reverse self-training to acquire more (synthetic) parallel data and with modeling target-side morphology

Biblio at Institute of Formal and Applied Linguistics

HindEnCorp – Hindi-English and Hindi-only Corpus for Machine Translation

Author: Bojar Ondřej
Diatka Vojtěch
Rychlý Pavel
Straňák Pavel
Suchomel Vít
Tamchyna Aleš
Zeman Daniel
Publication venue
Publication date: 01/01/2014
Field of study

We present HindEnCorp, a parallel corpus of Hindi and English, and HindMonoCorp, a monolingual corpus of Hindi in their release version 0.5. Both corpora were collected from web sources and preprocessed primarily for the training of statistical machine translation systems. HindEnCorp consists of 274k parallel sentences (3.9 million Hindi and 3.8 million English tokens). HindMonoCorp amounts to 787 million tokens in 44 million sentences. Both the corpora are freely available for non-commercial research and their preliminary release has been used by numerous participants of the WMT 2014 shared translation task

Biblio at Institute of Formal and Applied Linguistics

Integrating a Discriminative Classifier into Phrase-based and Hierarchical Decoding

Author: Braune Fabienne
Carpuat Marine
Daumé III Hal
Fraser Alexander
Quirk Chris
Tamchyna Aleš
Publication venue
Publication date: 01/01/2014
Field of study

Current state-of-the-art statistical machine translation (SMT) relies on simple feature functions which make independence assumptions at the level of phrases or CFG rules. However, it is well-known that discriminative models can benefit from rich features extracted from the source sentence context outside of the applied phrase or CFG rule, which is available at decoding time. We present a framework for the open-source decoder Moses that allows discriminative models over source context to easily be trained on a large number of examples and then be included as feature functions in decoding

CiteSeerX

Biblio at Institute of Formal and Applied Linguistics

Machine Translation of Medical Texts in the Khresmoi Project

Author: Dušek Ondřej
Hajič Jan
Hlaváčová Jaroslava
Novák Michal
Pecina Pavel
Rosa Rudolf
Tamchyna Aleš
Urešová Zdeňka
Zeman Daniel
Publication venue
Publication date: 01/01/2014
Field of study

The WMT 2014 Medical Translation Task poses an interesting challenge for Machine Translation (MT). In the standard translation task, the end application is the translation itself. In this task, the MT system is considered a part of a larger system for cross-lingual information retrieval (IR)

Crossref

Biblio at Institute of Formal and Applied Linguistics

Adaptation of machine translation for multilingual information retrieval in the medical domain

Author: Dušek Ondřej
Goeuriot Lorraine
Hajič Jan
Hlaváčová Jaroslava
Jones Gareth J.F.
Kelly Liadh
Leveling Johannes
Mareček David
Novák Michal
Pecina Pavel
Popel Martin
Rosa Rudolf
Tamchyna Aleš
Urešová Zdeňka
Publication venue: 'Elsevier BV'
Publication date: 01/01/2014
Field of study

Objective. We investigate machine translation (MT) of user search queries in the context of cross-lingual information retrieval (IR) in the medical domain. The main focus is on techniques to adapt MT to increase translation quality; however, we also explore MT adaptation to improve eectiveness of cross-lingual IR. Methods and Data. Our MT system is Moses, a state-of-the-art phrase-based statistical machine translation system. The IR system is based on the BM25 retrieval model implemented in the Lucene search engine. The MT techniques employed in this work include in-domain training and tuning, intelligent training data selection, optimization of phrase table configuration, compound splitting, and exploiting synonyms as translation variants. The IR methods include morphological normalization and using multiple translation variants for query expansion. The experiments are performed and thoroughly evaluated on three language pairs: Czech–English, German–English, and French–English. MT quality is evaluated on data sets created within the Khresmoi project and IR eectiveness is tested on the CLEF eHealth 2013 data sets. Results. The search query translation results achieved in our experiments are outstanding – our systems outperform not only our strong baselines, but also Google Translate and Microsoft Bing Translator in direct comparison carried out on all the language pairs. The baseline BLEU scores increased from 26.59 to 41.45 for Czech–English, from 23.03 to 40.82 for German–English, and from 32.67 to 40.82 for French–English. This is a 55% improvement on average. In terms of the IR performance on this particular test collection, a significant improvement over the baseline is achieved only for French–English. For Czech–English and German–English, the increased MT quality does not lead to better IR results. Conclusions. Most of the MT techniques employed in our experiments improve MT of medical search queries. Especially the intelligent training data selection proves to be very successful for domain adaptation of MT. Certain improvements are also obtained from German compound splitting on the source language side. Translation quality, however, does not appear to correlate with the IR performance – better translation does not necessarily yield better retrieval. We discuss in detail the contribution of the individual techniques and state-of-the-art features and provide future research directions

Crossref

Hal - Université Grenoble Alpes

Irish Universities

DCU Online Research Access Service

Biblio at Institute of Formal and Applied Linguistics